From: PHL Geo Inc. TO: Philadelphia City Council Subject: Introduction of a new development monitoring tool
Dear City Council Members,
We are PHL Geo Inc., a Philadelphia based civil-service oriented coding organization using the power of geospatial analysis and city-owned data to provide informed insights to decision makers such as yourself. We are reaching out to introduce an exciting new tool which can be used in your districts immediately. The tool is called the GENTRISK and uses 10 years of development information to predict where development may happen in your community in the near future. The GENTRISK may be used in your district to manage the negative impacts of development, including trash dumping, street closures, and noise complaints. Getting ahead of development impacts can improve coordination and communication with residents in your district, improve the likelihood of successful permit applications, and enable smart zoning and growth in your city council region.
Development is a process determined by numerous factors and can have positive effects; however, development-induced gentrification, which is “a process a neighborhood change that includes economic change in a historically disinvested neighborhood - by means of real estate investment and new higher-income residents moving in - as well as demographic change - not only in terms of income level, but also in terms of changes in the education level or racial make-up of residents,” can have negative effects (Demsas, 2021). As city council members looking to promote econoimc development while ensuring that residents have the protections that they need, the GENTRISK can provide tools that will help you see where development pressures are happening and engage in proactive outreach to those communities. Aligning city services such as increaseing access to Philadelhpia’s Homestead Exemption or to the PA Homeowner Assistance fund, can reduce these negative impacts.
The GENTRISK uses information from the past 10 years of development and combines that with external factors including the distance to schools, tree density, and distance to public transit, in order to understand what is driving development in Philadelphia. The GENTRISK then uses the relationships between those factors to predict where development may occur so you can get ahead of any potential negative impacts. These predictions form the basis of several tools available in GENTRISK, such as scenario planning. In the scenario planning framework, you may modify inputs to see how they will impact development going forward. For example, if the new Philly Tree Plan is likely to bring more tree canopy to your community, you can use the GENTRISK to model how much further development is likely to occur.
The following document provides a technical overview of how the GENTRISK functions, how to interpret its results, and how it can be incorporated into your city planning techniques. We look forward to seeing GENTRISK integrated into your city planning toolkit.
For questions, comments, concerns, or to learn more, please email phlgeo@gmail.com
Kindly, Akira DiSandro Benjamin Myers
PHL GEO Inc. Founders
The GENTRISK tool uses the construction of new housing and commercial buildings as its core output, as construction provides both opportunities (in the form of economic development and increased tax base) and pressures (in the form of gentrification and noise/trash) on residents and neighborhoods. Construction information is available from the Department of Licenses and Inspection for a time period stretching over the last 10 years. While neighborhoods are constantly in a state of change, focusing in on a more specific time period can provide City Council with an accurate understanding of what is driving development in the city. THe time period from 2013 to 2018 provides a good baseline of understanding development drivers as it is prior to the impacts that the COVID-19 global pandemic had on development drivers, yet recent enough that certain key drivers - such as the turnover of permanent residents for renters - are still prevalent within Philadelphia.
The model powering GENTRISK requires that all of the variables are gathered in the same format and then compared across time. To facilitatate this process, the PHL GEO Inc. team used the following code to gather all of the variables and put them into a similar format. These variables can then provide an understnaidng of how neighborhoods are changing over time, followed by an investigation of what could be driving those changes; ie, if development is happening in a neighborhood, what are the factors that drive that development?
A key way to contextualize these changes is to look at the changes that are happening within city council ditsrcts throughout Philadelphia. AS is shown through the map below, there are 10 city council districts in Philadelphia where changes and development pressures will have to be accoutned for.
# philly permit data
dat_permit <- st_read("https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+permits&filename=permits&format=geojson&skipfields=cartodb_id")
# only keep new construction permits
newcon_permits <- dat_permit %>%
filter(grepl("NEW CON|NEWCON",typeofwork)) %>%
mutate(year = substr(permitissuedate, 1,4)) %>%
filter(year %in% c(2013:2019,2022)) %>%
st_transform(crs = 2272)
rm(dat_permit) # remove this big file because we won't use this anymore
city_council <- st_read("https://opendata.arcgis.com/api/v3/datasets/1ba5a5d68f4a4c75806e78b1d9245924_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1")
city_council <- city_council %>%
mutate(DISTRICT = factor(DISTRICT, levels = 1:10))
# census data
acs_variable_list.2019 <- load_variables(2019, #year
"acs5", #five year ACS estimates
cache = TRUE) %>%
filter(geography == "block group")
# variables of interest:
# B19013_001: medHHincome
# B01001_001: total pop
# B02001_002: white pop
# B25058_001: median rent
# B15003_022: attainment of bachelor's of population 25+
# B25071_001: Median gross rent as a percentage of household income (past year)
# B07201_001: Geographical Mobility in the Past Year for Current Residence--Metropolitan Statistical Area Level in the United States, estimated total
# B07201_002: Geographical Mobility in the Past Year for Current Residence--Metropolitan Statistical Area Level in the United States, same house 1 year ago
# B25008_002: total population in occupied housing by tenure, owner occupied
# B25008_003: total population in occupied housing by tenure, renter occupied
# B99082_001: allocation of private vehicle occupancy
# B25044_001: tenure by vehicles available
# B09002_001: own children under 18
# B11004_001: related children under 18
census_vars <- paste0(c("B01001_001", "B19013_001", "B02001_002", "B25058_001", "B15003_022", "B25071_001", "B07201_001",
"B07201_002", "B25008_002", "B25008_003", "B99082_001", "B25044_001", "B09002_001",
"B11004_001"),"E") # census variables of interest that are available
medHHinc2019 <- 70459 # source: https://www.deptofnumbers.com/income/pennsylvania/philadelphia/
tracts19 <-
get_acs(geography = "block group",
variables = census_vars,
year = 2019, state = 42,
geometry = T, output = "wide") %>%
st_transform(crs = 2272) %>%
dplyr::select(!matches("M$")) %>%
rename(total_pop = B01001_001E,
med_hh_inc = B19013_001E,
white_pop = B02001_002E,
med_rent = B25058_001E,
bachelors25 = B15003_022E,
pct_rent_hhinc = B25071_001E,
mobility_tot_metro = B07201_001E,
samehouse1yr_metro = B07201_002E,
owner_occ = B25008_002E,
renter_occ = B25008_003E,
private_vehicle_occ = B99082_001E,
vehicles_avail = B25044_001E,
) %>%
mutate(children = B09002_001E + B11004_001E,
year = "2019") %>%
dplyr::select(!matches("^B|white_pop",ignore.case = F)) %>%
st_centroid()
# philly neighborhood data
nhoods_path <- 'https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson'
nhoods <- st_read(nhoods_path, quiet = T) %>%
st_transform(crs = 2272) %>%
dplyr::select(mapname)
# philly bounds
philly <- st_read("https://opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson") %>%
st_transform(crs = 2272) %>%
dplyr::select(OBJECTID,geometry)
# limit permit dat to philly border
newcon_permits <- newcon_permits %>%
.[philly,]
# 311 incident reports data
carto_url = "https://phl.carto.com/api/v2/sql"
# Crime incidents
table_name_311 = "public_cases_fc"
# query
where_311 = "closed_datetime >= '2010-01-01' AND closed_datetime < '2020-01-01' AND service_name IN ('Rubbish/Recyclable Material Collection','Illegal Dumping','Construction Complaints', 'Building Construction', 'Sanitation Violation', 'Street Trees', 'Dangerous Sidewalk', 'Homeless Encampment Request', 'Parks and Rec Safety and Maintenance', 'Vacant House or Commercial')"
query311 = paste("SELECT *",
"FROM", table_name_311,
"WHERE", where_311)
reports311 = rphl::get_carto(query311, format = "csv", base_url = carto_url, stringsAsFactors = F)
reports311 <- reports311 %>%
mutate(Year = year(format.Date(reports311$closed_datetime)))
reports311 <- reports311 %>%
filter(!is.na(lon) & !is.na(lat)) %>%
st_as_sf(coords = c("lon","lat"), crs = 4326) %>%
st_transform(crs = 2272) %>%
mutate(
type = str_replace_all(service_name, "[^[:alnum:]]", ""),
Legend = "311") %>%
dplyr::select(Legend, type, geometry, Year)
# plot city council districts
ggplot() +
geom_sf(data=city_council, aes(fill=DISTRICT)) +
scale_fill_viridis(discrete = T) +
labs(fill = "City Council District",
title = "City Council Districts Within Philadelphia",
caption = "Figure 1.") +
mapTheme()
Development patterns may be spread over space, such that it forms a smooth ‘mosaic’ across the city of Philadelphia. Alternatively, it is possible that several permits may be clustered together, either because they were all submitted as part of the same construction project (ie, submitting an electrical permit and roofing permit for the same property) or there were several units being worked in tandem. To better understand whether Philadelphia is experiencing a ‘mosaic’ of development or if development has been clustered into particular areas, a fishnet was established for grid cells of 500 meter by 500 meters across the city. The 500 meter scale (1/3rd mile) allows for a fluid understanding of development patterns at the block scale.
Figure 1 below shows the number of permits granted for each fishnet square in 2014, followed by Figure 2 showing the number of permits for each fishnet square in 2015, and so on until Figure 6 demonstrating the number of permits granted for each fishnet square in 2019. As is visible through the maps, trends and patterns of development shift across the city, such that in 2014 development centered around the Center City Neighborhood, while moving towards 2019 - when record number of permits were granted for the time period in focus - development was centered in Center City as well as North Philadelphia.
It is worthwhile to note that while there are a number of areas with extensive permit counts and development pressures, so too are there a significant number of fishnet cells in the city that did not experience much - if any - development. These low-no development areas, from the perspective of new construction, could have an outside influence on the model with impacts on its accuracy in areas experiecing higher levels of development. This was factored into the model through the use of a Poisson distribution, described in greater detail below.
# our CRS is in feet, but want to make fishnet a 500m x 500m
cell_size <- 500 * 3.28084
fishnet <-
st_make_grid(philly,
cellsize = cell_size,
square = TRUE,
crs = 2272) %>%
.[philly] %>%
st_sf() %>%
mutate(uniqueID = 1:n())
# 2013
{
permit_net13 <-
dplyr::select(newcon_permits %>% filter(year == "2013")) %>%
mutate(count_permits = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count_permits13 = replace_na(count_permits, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE))
}
# 2014
{
permit_net14 <-
dplyr::select(newcon_permits %>% filter(year == "2014")) %>%
mutate(count_permits = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count_permits14 = replace_na(count_permits, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>%
dplyr::select(-count_permits)
}
# 2015
{
permit_net15 <-
dplyr::select(newcon_permits %>% filter(year == "2015")) %>%
mutate(count_permits = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count_permits15 = replace_na(count_permits, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>%
dplyr::select(-count_permits)
}
# 2016
{
permit_net16 <-
dplyr::select(newcon_permits %>% filter(year == "2016")) %>%
mutate(count_permits = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count_permits16 = replace_na(count_permits, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>%
dplyr::select(-count_permits)
}
# 2017
{
permit_net17 <-
dplyr::select(newcon_permits %>% filter(year == "2017")) %>%
mutate(count_permits = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count_permits17 = replace_na(count_permits, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>%
dplyr::select(-count_permits)
}
# 2018
{
permit_net18 <-
dplyr::select(newcon_permits %>% filter(year == "2018")) %>%
mutate(count_permits = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count_permits18 = replace_na(count_permits, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>%
dplyr::select(-count_permits)
}
# 2019
{
permit_net19 <-
dplyr::select(newcon_permits %>% filter(year == "2019")) %>%
mutate(count_permits = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count_permits19 = replace_na(count_permits, 0),
uniqueID = as.numeric(rownames(.)),
cvID = sample(round(nrow(fishnet) / 16), size=nrow(fishnet), replace = TRUE)) %>%
dplyr::select(-count_permits)
}
# combine permit counts from all years
permit_net_allyrs <- permit_net13 %>%
st_drop_geometry() %>%
dplyr::select(uniqueID,cvID,count_permits13) %>%
left_join(permit_net14 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits14), by = "uniqueID") %>%
left_join(permit_net15 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits15), by = "uniqueID") %>%
left_join(permit_net16 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits16), by = "uniqueID") %>%
left_join(permit_net17 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits17), by = "uniqueID") %>%
left_join(permit_net18 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits18), by = "uniqueID") %>%
left_join(permit_net19 %>% st_drop_geometry() %>% dplyr::select(uniqueID,count_permits19), by = "uniqueID") %>%
left_join(permit_net13 %>% dplyr::select(-c(cvID,count_permits13,count_permits)), by = "uniqueID")
# map of permit count over time
permit_net_formap <- permit_net_allyrs %>%
dplyr::select(-c(uniqueID,cvID,count_permits14,count_permits16,count_permits18)) %>%
rename(`Permit Count 2013` = count_permits13,
`Permit Count 2015` = count_permits15,
`Permit Count 2017` = count_permits17,
`Permit Count 2019` = count_permits19) %>%
gather("count_permits", "value", -geometry) %>%
st_as_sf()
permit_count_map <- ggplot(data = permit_net_formap) +
geom_sf(aes(fill = value)) +
scale_fill_viridis() +
facet_wrap(~count_permits) +
labs(title = "Count of New Construction Permits for the fishnet",
caption = "Figure 2.") +
mapTheme(title_size = 14) + theme(legend.position="bottom")
permit_count_map
There are a number of different methods to account for neighborhood change and the impacts of development, all of which are powered. The GENTRISK takes into account several census features, which are described below (along with their reasoning)
The model powering GENTRISK requires that all of the variables are gathered in the same format and then compared across time. To facilitatate this process, the PHL GEO Inc. team used the following code to gather all of the variables and put them into a similar format. These variables can then provide an understnaidng of how neighborhoods are changing over time, followed by an investigation of what could be driving those changes; ie, if development is happening in a neighborhood, what are the factors that drive that development?
# function to make fishnet from acs dataframe (dat_tract), for variable (var_name), giving it the legend (var_name), aggregating with function (sum_or_mean)
# for example, to make a fishnet summing all total_pop variables for 2019, one would define the variables as follows:
# dat_tract <- tracts19; var_name <- "total_pop"; sum_or_mean <- sum
makenet <- function(dat_tract, var_name, sum_or_mean){
net_tract <- dat_tract %>%
.[philly,] %>%
dplyr::select(matches(var_name)) %>%
mutate(Legend = var_name) %>%
st_join(fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = sum_or_mean(get(var_name), na.rm = T)) %>%
left_join(fishnet, ., by = "uniqueID") %>% # add geometry back in
spread(Legend, count, fill=0) %>% # fill in ones where fishnet was missing, count was NA with 0
dplyr::select(-`<NA>`) %>%
ungroup()
return(net_tract)
}
var_names <- names(tracts19)[3:14]
# for loop to create fish nets for all ACS variables for all years
for (i in 1:length(var_names)){
# select year, variable, and legend name
var_name <- var_names[i]
if (i %in% c(2:4,18)){
sum_or_mean <- mean
} else {
sum_or_mean <- sum
}
# run function to make fishnet
net_tract <- makenet(tracts19,var_name,sum_or_mean)
# save as individual fishnet, uncomment if neededd
# net_tract_name <- paste0("net_",var_name,yr)
# assign(net_tract_name, net_tract)
# save all variables into one big net
if (i == 1) {
census_net <- net_tract %>%
st_drop_geometry()
} else {
net_tract_tojoin <- net_tract %>%
st_drop_geometry() %>%
dplyr::select(1:2)
census_net <- left_join(census_net, net_tract_tojoin, by = "uniqueID")
}
}
long_15 <- reports311 %>%
filter(Year == 2015) %>%
arrange(type) %>%
mutate(Legend = paste0(type, Year), .keep = "unused")
long_16 <- reports311 %>%
filter(Year == 2016) %>%
arrange(type) %>%
mutate(Legend = paste0(type, Year), .keep = "unused")
long_17 <- reports311 %>%
filter(Year == 2017) %>%
arrange(type) %>%
mutate(Legend = paste0(type, Year), .keep = "unused")
long_18 <- reports311 %>%
filter(Year == 2018) %>%
arrange(type) %>%
mutate(Legend = paste0(type, Year), .keep = "unused")
long_19 <- reports311 %>%
filter(Year == 2019) %>%
arrange(type) %>%
mutate(Legend = paste0(type, Year), .keep = "unused")
vars311_net <- rbind(long_15, long_16, long_17, long_18, long_19) %>%
st_join(fishnet, join = st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
left_join(fishnet, ., by = "uniqueID") %>% # add geometry back in
spread(Legend, count, fill=0) %>% # fill in ones where fishnet was missing, count was NA with 0
dplyr::select(-`<NA>`) %>%
ungroup()
# use something like code below (from lab 6) to create nearest neighbor counts
# convenience to reduce length of function names.
st_c <- st_coordinates
st_coid <- st_centroid
vars311_net_ccoid <- st_c(st_coid(vars311_net))
# add nearest neighbor features
vars311_net <- vars311_net %>%
mutate(buildingCon15.dist = nn_function(vars311_net_ccoid,
st_c(long_15 %>% filter(Legend == "BuildingConstruction2015")), 3),
dangerSide15.dist = nn_function(vars311_net_ccoid,
st_c(long_15 %>% filter(Legend == "DangerousSidewalk2015")), 3),
illegalDump15.dist = nn_function(vars311_net_ccoid, st_c(long_15 %>% filter(Legend == "IllegalDumping2015")), 3),
parksNrec15.dist = nn_function(vars311_net_ccoid,
st_c(long_15 %>% filter(Legend == "ParksandRecSafetyandMaintenance2015")), 3),
materialColl15.dist = nn_function(
vars311_net_ccoid, st_c(long_15 %>% filter(Legend == "RubbishRecyclableMaterialCollection2015")), 3),
streetTrees15.dist = nn_function(vars311_net_ccoid, st_c(long_15 %>% filter(Legend == "StreetTrees2015")), 3),
vacant15.dist = nn_function(vars311_net_ccoid,
st_c(long_15 %>% filter(Legend == "VacantHouseorCommercial2015")), 3),
buildingCon16.dist = nn_function(vars311_net_ccoid,
st_c(long_16 %>% filter(Legend == "BuildingConstruction2016")), 3),
dangerSide16.dist = nn_function(vars311_net_ccoid,
st_c(long_16 %>% filter(Legend == "DangerousSidewalk2016")), 3),
illegalDump16.dist = nn_function(vars311_net_ccoid, st_c(long_16 %>% filter(Legend == "IllegalDumping2016")), 3),
parksNrec16.dist = nn_function(vars311_net_ccoid,
st_c(long_16 %>% filter(Legend == "ParksandRecSafetyandMaintenance2016")), 3),
materialColl16.dist = nn_function(
vars311_net_ccoid, st_c(long_16 %>% filter(Legend == "RubbishRecyclableMaterialCollection2016")), 3),
streetTrees16.dist = nn_function(vars311_net_ccoid, st_c(long_16 %>% filter(Legend == "StreetTrees2016")), 3),
vacant16.dist = nn_function(vars311_net_ccoid,
st_c(long_16 %>% filter(Legend == "VacantHouseorCommercial2016")), 3),
buildingCon17.dist = nn_function(vars311_net_ccoid,
st_c(long_17 %>% filter(Legend == "BuildingConstruction2017")), 3),
dangerSide17.dist = nn_function(vars311_net_ccoid,
st_c(long_17 %>% filter(Legend == "DangerousSidewalk2017")), 3),
illegalDump17.dist = nn_function(vars311_net_ccoid, st_c(long_17 %>% filter(Legend == "IllegalDumping2017")), 3),
parksNrec17.dist = nn_function(vars311_net_ccoid,
st_c(long_17 %>% filter(Legend == "ParksandRecSafetyandMaintenance2017")), 3),
materialColl17.dist = nn_function(
vars311_net_ccoid, st_c(long_17 %>% filter(Legend == "RubbishRecyclableMaterialCollection2017")), 3),
streetTrees17.dist = nn_function(vars311_net_ccoid, st_c(long_17 %>% filter(Legend == "StreetTrees2017")), 3),
vacant17.dist = nn_function(vars311_net_ccoid,
st_c(long_17 %>% filter(Legend == "VacantHouseorCommercial2017")), 3),
buildingCon18.dist = nn_function(vars311_net_ccoid,
st_c(long_18 %>% filter(Legend == "BuildingConstruction2018")), 3),
dangerSide18.dist = nn_function(vars311_net_ccoid,
st_c(long_18 %>% filter(Legend == "DangerousSidewalk2018")), 3),
illegalDump18.dist = nn_function(vars311_net_ccoid, st_c(long_18 %>% filter(Legend == "IllegalDumping2018")), 3),
parksNrec18.dist = nn_function(vars311_net_ccoid,
st_c(long_18 %>% filter(Legend == "ParksandRecSafetyandMaintenance2018")), 3),
materialColl18.dist = nn_function(
vars311_net_ccoid, st_c(long_18 %>% filter(Legend == "RubbishRecyclableMaterialCollection2018")), 3),
streetTrees18.dist = nn_function(vars311_net_ccoid, st_c(long_18 %>% filter(Legend == "StreetTrees2018")), 3),
vacant18.dist = nn_function(vars311_net_ccoid,
st_c(long_18 %>% filter(Legend == "VacantHouseorCommercial2018")), 3),
buildingCon19.dist = nn_function(vars311_net_ccoid,
st_c(long_19 %>% filter(Legend == "BuildingConstruction2019")), 3),
dangerSide19.dist = nn_function(vars311_net_ccoid,
st_c(long_19 %>% filter(Legend == "DangerousSidewalk2019")), 3),
illegalDump19.dist = nn_function(vars311_net_ccoid, st_c(long_19 %>% filter(Legend == "IllegalDumping2019")), 3),
parksNrec19.dist = nn_function(vars311_net_ccoid,
st_c(long_19 %>% filter(Legend == "ParksandRecSafetyandMaintenance2019")), 3),
materialColl19.dist = nn_function(
vars311_net_ccoid, st_c(long_19 %>% filter(Legend == "RubbishRecyclableMaterialCollection2019")), 3),
streetTrees19.dist = nn_function(vars311_net_ccoid, st_c(long_19 %>% filter(Legend == "StreetTrees2019")), 3),
vacant19.dist = nn_function(vars311_net_ccoid,
st_c(long_19 %>% filter(Legend == "VacantHouseorCommercial2019")), 3))
All census variables were pulled at the block group level, which is a smaller scale than our fishnet. There is no perfect way to join block-group-level census data to our fishnet, so our approach was to take the centroid of the block groups and aggregated the variables (by either taking the sum or average) of the block groups whose centroids landed in a fishnet grid cell (idk how to phrase this well so feel free to rephrase!). As a result, some fishnet cells that cover residential areas of Philadelphia may have a value of 0 for variables like “Total Population” since no block group centroid was within the bounds of that cell.
# join census vars fishnet to permit count net + 311 net
all_net <- left_join(permit_net_allyrs, census_net, by = "uniqueID") %>%
left_join(vars311_net %>% st_drop_geometry(), by = "uniqueID") %>%
st_as_sf()
# making this moreso to see which would actually be good predictors
var_for_model <- c("uniqueID","cvID","count_permits19","count_permits18","count_permits17","count_permits16",
"count_permits15","BuildingConstruction2019","IllegalDumping2019","DangerousSidewalk2019",
"RubbishRecyclableMaterialCollection2019","StreetTrees2019","parksNrec19.dist",
"VacantHouseorCommercial2019","vehicles_avail","med_rent","renter_occ","total_pop",
"samehouse1yr_metro","bachelors25","children")
all_net19 <- all_net %>%
dplyr::select(all_of(var_for_model))
# only keep years for count_permit columns
names(all_net19) <- c(str_replace_all(var_for_model, c("2019" = "","19." = ".")), "geometry")
for_cormat19 <- all_net19 %>%
st_drop_geometry() %>%
dplyr::select(-c(uniqueID,cvID))
ggcorrplot(
round(cor(for_cormat19), 1),
p.mat = cor_pmat(for_cormat19),
colors = c("#4b2875", "white", "#9c1339"),
type="lower",
insig = "blank",
digits = 4,
lab = T, lab_size = 2) +
labs(title = "Correlation Matrix for 2019 fishnet",
caption = "Figure 3.")
We also added some features that explain the spatial process of permit issuance in Philadelphia. Using local Moran’s I, we examined the spatial process which helps us understand whether the permit count at a certain location is randomly distributed or clustered relative to its immediate neighbors.
Since the outcome we are interested in is a count or sum of vandalism incidents, I use a Poisson regression model to estimate the outcome. I created two types of models – one with just risk factors as the predictors and another with both risk factors and spatial process (I ultimately only kept the model with both risk factors and spatial process).
all_net19 %>% ggplot(aes(x = count_permits19)) +
geom_histogram(bins = 66, fill = viridis::cividis(1)) +
theme(text = element_text( color = "black"),
plot.title = element_text(size = 14, colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)) +
labs(title = "Permit Distribution",
x = "Count of New Construction Permits", y = "Count",
caption = "Figure 4.")
To assess the validity of my models I look at the accuracy (how accurate are my predictions in the set that the models are trained on?) and more importantly, generalizability (can I predict counts of vandalism across different neighborhoods? For another year?).
To assess accuracy of my models, I examined the mean absolute error (MAE; where error = predicted value - observed value) in the dataset containing vandalism incidents for 2021 – the same data we used to train the model.
The most important validity measure of a geospatial risk prediction model is the generalizability as we want to be able to train a model and predict future outcomes, in our case counts of vandalism. I use random k-fold (with k ~ 100) cross validation as well as ‘Leave-one-group-out’ cross-validation (LOGO-CV) to assess model generalizability. LOGO-CV helps us assess model generalizability across neighborhoods. I also predict vandalism risk scores for the following year (2022) to assess whether the model generalizes across time using the 2021 kernel density.
[description of figure 4]
# add neighborhood names
allnet19_formodel <- all_net19 %>%
st_centroid() %>%
st_join(dplyr::select(nhoods, mapname)) %>%
st_drop_geometry() %>%
left_join(dplyr::select(all_net19, geometry, uniqueID), by = "uniqueID") %>%
st_sf()
vars_net.long <- gather(allnet19_formodel %>% dplyr::select(-count_permits19),
variable, value, -geometry, -uniqueID, -cvID, -mapname)
vars <- unique(vars_net.long$variable)
varList <- list()
for (i in vars) {
varList[[i]] <- ggplot() +
geom_sf(data = filter(vars_net.long, variable == i),aes(fill = value), colour=NA) +
scale_fill_viridis(name = "") +
labs(title = i) +
theme(text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.text.x = element_text(size = 7),
legend.position = "bottom",
legend.text = element_text(size = 5))
}
do.call(grid.arrange,c(varList, ncol = 3, top = "Predictors of Permit Count (on fishnet)", bottom = "Figure 5."))
[description of figure 6]
final_net.nb <- poly2nb(as_Spatial(allnet19_formodel), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE) # turn neighborhood weights into list of weights
local_morans <- localmoran(allnet19_formodel$count_permits19, final_net.weights, zero.policy=TRUE) %>%
as.data.frame() # Ii moran's I at ith cell, Ei expected/mean from neighbors
# join local Moran's I results to fishnet
final_net.localMorans <-
cbind(local_morans, as.data.frame(allnet19_formodel)) %>%
st_sf() %>%
dplyr::select(`Permit Count` = count_permits19,
`Local Morans I` = Ii,
`P Value` = `Pr(z != E(Ii))`) %>%
mutate(`Significant Hotspots` = ifelse(`P Value` <= 0.001, 1, 0)) %>%
gather(variable, value, -geometry)
# now plot
vars <- unique(final_net.localMorans$variable)
varList <- list()
for(i in vars){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, variable == i),aes(fill = value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme(title_size = 14) + theme(legend.position="bottom")
}
do.call(grid.arrange,c(varList, ncol = 2, top = "Local Moran's I Statistics for Permit Count in Philadelphia",
bottom = "Figure 6."))
final_net <-
allnet19_formodel %>%
mutate(permitct.isSig =
ifelse(localmoran(allnet19_formodel$count_permits19,
final_net.weights)[,5] <= 0.001, 1, 0)) %>%
mutate(permitct.isSig.dist =
nn_function(st_coordinates(st_centroid(allnet19_formodel)),
st_coordinates(st_centroid(
filter(allnet19_formodel, permitct.isSig == 1))), 1))
[description of figure 6]
correlation.long <-
st_drop_geometry(final_net) %>%
dplyr::select(-uniqueID, -cvID, -mapname) %>%
gather(variable, value, -count_permits19)
correlation.cor <-
correlation.long %>%
group_by(variable) %>%
summarize(correlation = cor(value, count_permits19, use = "complete.obs"))
ordered_vars <- c("count_permits18","count_permits17","count_permits16","count_permits15","total_pop","bachelors25",
"children","med_rent","renter_occ","samehouse1yr_metro","vehicles_avail","BuildingConstruction",
"IllegalDumping","DangerousSidewalk","StreetTrees","RubbishRecyclableMaterialCollection",
"VacantHouseorCommercial","parksNrec.dist","permitct.isSig","permitct.isSig.dist")
ggplot(correlation.long %>%
mutate(variable = factor(variable, levels = ordered_vars)),
aes(value, count_permits19)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor %>%
mutate(variable = factor(variable, levels = ordered_vars)),
aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "black") +
facet_wrap(~variable, ncol = 3, scales = "free") +
labs(title = "Permit count as a function of predictors",
caption = "Figure 7.") +
theme(text = element_text( color = "black"),
plot.title = element_text(size = 14, colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 10))
# just risk factors
reg.vars <- c("count_permits18","count_permits17","count_permits16","count_permits15","BuildingConstruction",
"IllegalDumping","DangerousSidewalk","RubbishRecyclableMaterialCollection","vehicles_avail",
"med_rent","renter_occ","VacantHouseorCommercial","total_pop","samehouse1yr_metro","bachelors25",
"StreetTrees","children","parksNrec.dist")
# random k-fold CV
reg.CV <- crossValidate(dataset = final_net,
id = "cvID",
dependentVariable = "count_permits19",
indVariables = reg.vars) %>%
dplyr::select(cvID, count_permits19, Prediction, geometry) %>%
mutate(error = count_permits19 - Prediction)
# MAE
reg.MAE <- mean(abs(reg.CV$error)) # ~7.573239
# with local Moran's I spatial process features
reg.sp.vars <- c(reg.vars,"permitct.isSig","permitct.isSig.dist")
## RUN REGRESSIONS
reg.spatialCV <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "count_permits19",
indVariables = reg.sp.vars) %>%
dplyr::select(cvID, count_permits19, Prediction, geometry) %>%
mutate(error = count_permits19 - Prediction)
# MAE
reg.spatial.MAE <- mean(abs(reg.spatialCV$error)) # ~5.595413
# LOGO-CV
# needed to redefine crossValidate to solve the issue with a neighborhood of NA
# this function also works for the previous lines using crossValidate()
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
allPredictions <- data.frame()
# cvID_list <- unique(dataset[[id]])
cvID_list <- as.character(na.omit(unique(dataset[[id]])))
for (i in cvID_list) {
thisFold <- i
cat("This hold out fold is", thisFold, "\n")
fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, all_of(indVariables),
all_of(dependentVariable))
fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, all_of(indVariables),
all_of(dependentVariable))
form_parts <- paste0(dependentVariable, " ~ ", paste0(indVariables, collapse = "+"))
form <- as.formula(form_parts)
regression <- glm(form, family = "poisson",
data = fold.train %>%
dplyr::select(-geometry, -id))
thisPrediction <-
mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
allPredictions <-
rbind(allPredictions, thisPrediction)
}
return(st_sf(allPredictions))
}
# adding neighborhood for LOGO CV, risk factors only
reg.logoCV <- crossValidate(
dataset = final_net,
id = "mapname",
dependentVariable = "count_permits19",
indVariables = reg.vars) %>%
dplyr::select(cvID = mapname, count_permits19, Prediction, geometry) %>%
mutate(error = count_permits19 - Prediction)
# MAE
reg.logo.MAE <- mean(abs(reg.logoCV$error)) # 8.273702
# risk factors + spatial process
reg.logo.spatialCV <- crossValidate(
dataset = final_net,
id = "mapname",
dependentVariable = "count_permits19",
indVariables = reg.sp.vars) %>%
dplyr::select(cvID = mapname, count_permits19, Prediction, geometry) %>%
mutate(error = count_permits19 - Prediction)
# MAE
reg.logo.spatial.MAE <- mean(abs(reg.logo.spatialCV$error)) # 6.170643
[description of figure 8 & 9]
reg.summary <- rbind(
mutate(reg.CV,
Error = Prediction - count_permits19,
Regression = "Random k-fold CV: Predictors"),
mutate(reg.spatialCV,
Error = Prediction - count_permits19,
Regression = "Random k-fold CV: Predictors with Spatial Process"),
mutate(reg.logoCV,
Error = Prediction - count_permits19,
Regression = "Spatial LOGO-CV: Predictors"),
mutate(reg.logo.spatialCV,
Error = Prediction - count_permits19,
Regression = "Spatial LOGO-CV: Predictors with Spatial Process")) %>%
st_sf()
error_by_reg_and_fold <-
reg.summary %>%
group_by(Regression, cvID) %>%
summarize(Mean_Error = mean(Prediction - count_permits19, na.rm = T),
MAE = mean(abs(Mean_Error), na.rm = T),
SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
ungroup()
# make map
# adjust mapTheme in order to fit title
mapTheme <- function(base_size = 12, title_size = 16) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = title_size,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.text.x = element_text(size = 7))
}
error_by_reg_and_fold %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
facet_wrap(~Regression) +
scale_fill_viridis() +
labs(title = "Errors by Cross Validation method",
caption = "Figure 8.") +
mapTheme(title_size = 14) + theme(legend.position="bottom")
Here, we show the real vs predicted counts of new construction permits in 2019 (see figure 9).
reg.summary %>%
filter(Regression == "Random k-fold CV: Predictors with Spatial Process") %>%
dplyr::select(-c(cvID,error,Error)) %>%
rename(`Real Permit Count` = count_permits19) %>%
gather("counts","value",-geometry,-Regression) %>%
ggplot() +
geom_sf(aes(fill = value)) +
scale_fill_viridis() +
facet_wrap(~counts) +
labs(title = "Real vs Predicted Counts of Permits (2019)",
caption = "Figure 9.") +
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.text.x = element_text(size = 10),
legend.position = "bottom")
# adjust plotTheme to fit titles
plotTheme <- function(base_size = 12, title_size = 16) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = title_size, colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 10)
)
}
error_by_reg_and_fold %>%
ggplot(aes(MAE)) +
geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
facet_wrap(~Regression) +
geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 450, by = 50)) +
labs(title="Distribution of MAE", subtitle = "k-fold cross validation vs. LOGO-CV",
x="Mean Absolute Error", y="Count",
caption = "Figure 10.") +
plotTheme(title_size = 14) + theme(legend.position="bottom")
[description of table 1] Table 1 below shows the summary of regressions, including the mean and standard deviation of MAE for both of the models and both of the cross validation methods. We highlighted the row showing the results of both cross validation methods for the model including spatial process, to make clear how much these spatial process features improved our model.
st_drop_geometry(error_by_reg_and_fold) %>%
group_by(Regression) %>%
summarize(`Mean MAE` = round(mean(MAE), 2),
`SD MAE` = round(sd(MAE), 2)) %>%
kable(caption = "Table 1: Summary of Regressions") %>%
kable_styling("striped", full_width = F) %>%
row_spec(c(2,4), bold = T, color = "white", background = viridis::cividis(1)) %>%
kable_classic(full_width = F, html_font = "Cambria")
| Regression | Mean MAE | SD MAE |
|---|---|---|
| Random k-fold CV: Predictors | 4.49 | 12.85 |
| Random k-fold CV: Predictors with Spatial Process | 2.76 | 4.07 |
| Spatial LOGO-CV: Predictors | 10.08 | 31.36 |
| Spatial LOGO-CV: Predictors with Spatial Process | 6.74 | 14.61 |
In order to assess if the model generalizes to different neighborhood contexts, we tested whether the model generalizes to a racial context. Using 2019 US Census data, I defined a neighborhood to be “Majority White” if over 50% of the population was White, and “Majority non-White” otherwise.
[description of table 2]
RaceContext <- get_acs(geography = "block group",
variables = c("B01001_001E","B02001_002E"),
year = 2019, state = 42,
geometry = T, output = "wide") %>%
st_transform(crs = 2272) %>%
dplyr::select(!matches("M$")) %>%
rename(total_pop = B01001_001E,
white_pop = B02001_002E) %>%
mutate(pct_white = ifelse(total_pop > 0, white_pop / total_pop,0),
RaceContext = ifelse(pct_white > 0.5, "Majority White",
ifelse(total_pop != 0 , "Majority non-White", NA))) %>%
dplyr::select(-c(NAME,total_pop,white_pop,pct_white)) %>%
.[nhoods,]
reg.summary %>%
filter(grepl("LOGO", Regression)) %>%
st_centroid() %>%
st_join(RaceContext) %>%
na.omit() %>%
st_drop_geometry() %>%
group_by(Regression, RaceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(RaceContext, mean.Error) %>%
kable(caption = "Table 2. Mean Error by Neighborhood Racial Context") %>%
kable_styling("striped", full_width = F) %>%
kable_classic(html_font = "Cambria")
| Regression | Majority non-White | Majority White |
|---|---|---|
| Spatial LOGO-CV: Predictors | 2.9338193 | 0.4763929 |
| Spatial LOGO-CV: Predictors with Spatial Process | 0.4727099 | 0.1580031 |
Similarly, we tested the generalizability of our model in an income context.
IncomeContext <- get_acs(geography = "block group",
variables = c("B19013_001E"),
year = 2019, state = 42,
geometry = T, output = "wide") %>%
st_transform(crs = 2272) %>%
dplyr::select(!matches("M$")) %>%
rename(med_hh_inc = B19013_001E) %>%
mutate(IncomeContext = ifelse(med_hh_inc > medHHinc2019, "Above Median Income", "Below Median Income")) %>%
dplyr::select(-c(NAME,med_hh_inc)) %>%
.[nhoods,]
reg.summary %>%
filter(grepl("LOGO", Regression)) %>%
st_centroid() %>%
st_join(IncomeContext) %>%
na.omit() %>%
st_drop_geometry() %>%
group_by(Regression, IncomeContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(IncomeContext, mean.Error) %>%
kable(caption = "Table 3. Mean Error by Neighborhood Income Context") %>%
kable_styling("striped", full_width = F) %>%
kable_classic(html_font = "Cambria")
| Regression | Above Median Income | Below Median Income |
|---|---|---|
| Spatial LOGO-CV: Predictors | 0.2946100 | 2.9462298 |
| Spatial LOGO-CV: Predictors with Spatial Process | 0.3275409 | 0.7222918 |
In conclusion….